Description
Inference
Prediction
In your groups brainstorm a real-world question you might want to answer using data.
Think about how you might go about using data to answer that question.
Statistical Inference is using information/data from a sample to draw conclusions about a population
\(\mathbf{X} = \left(X_1, X_2, ... X_n \right)\) is a sample of data from a distribution \(P_{\theta}\). We want to use \(\mathbf{X}\) to learn about \(P_{\theta}\) since we can’t directly observe \(P_{\theta}\).
Statistical Inference is using information/data from a sample to draw conclusions about a population
\(\mathbf{X} = \left(X_1, X_2, ... X_n \right)\) is a sample of data from a distribution \(P_{\theta}\). We want to use \(\mathbf{X}\) to learn about \(P_{\theta}\) since we can’t directly observe \(P_{\theta}\).
Populations can be thought of as:
Think about the heights of all people in the USA named Michael.
Group of Existing People: All 3.28M Michaels
DGP for Michaels: the theoretical process that creates Michaels
I claim that I’m faster at crosswords than you. We both do one crossword:
my time: 25m 05s
your time: 25m 23s
Am I right? What would it take to convince you I’m right?
What do I really mean when I say that I’m faster at crosswords that you?
Statistics: are functions of data that summarize the data, \(T(\mathbf{X})\).
Population Statistic: \(T(\mathbf{X})\); where \(\mathbf{X}\) is the random variable (e.g. mean height of people named Michael)
Sample Statistic: \(T(\mathbf{x})\); where \(\mathbf{x}\) is a realized sample of \(\mathbf{X}\) (e.g. mean of 100 randomly sampled heights of people named Michael)
Sample Mean: \(\frac{1}{N} \sum_{i=1}^N x_i\)
(or: 67th percentile, min, max, variance, z-statistic…)
When choosing a statistic, you’re implicitly agreeing that two samples are the same if \(T(\mathbf{x}) = T(\mathbf{y})\)
I claim that my mean crossword time is faster than yours:
\[ \mu_{me} > \mu_{you} \]
Note: here, let’s think of our mean times as the DGP that generates our observed times:
\[ \text{Chelsea}_i \sim N \left (\mu_{me}, \sigma_{me} \right) \\ \text{You}_i \sim N \left (\mu_{you}, \sigma_{you} \right) \]
But I can’t possibly observe \(\mu_{me}\) and \(\mu_{you}\)…
If we were doing description, the sample mean alone would accomplish our goal.
But if doing inference, we want to generalize. We don’t want to know about \(\bar{x}\), we want to know about \(\mu\).
Note: we often use Greek letters to denote population statistics and Roman letters for sample statistics.
New Goal: Formalize a process to get from Sample statistics to Population statistics
e.g. what can I learn about the mean height of college students, \(\mu\) from random sample mean \(\bar{x}\)
via Peter Tennant at https://x.com/PWGTennant/status/1164084443742691328
Estimand: a target quantity to be estimated
Estimator: a function, \(W(\mathbf{x})\) that is a recipe about how to get an estimate from a sample
Estimate: a realized value of \(W(\mathbf{x})\) applied to an actual sample, \(\mathbf{x}\)
Sometimes, finding an estimator is intuitive (e.g. using sample mean to estimate population mean) but other times it’s not.
Method of Moments
Maximum Likelihood Estimators
Set the first \(k\) sample moments equal to the first \(k\) population moments, and solving.
\[ \underbrace{m_1}_\text{1st samp moment} = \overbrace{\mu'_1}^\text{1st population moment} \\ m_2 = \mu'_2 \\ ... \\ m_n = \mu'_n \\ \]
Moments of a distribution are expectations.
\[ \mu'_n = \mathbb{E}X^n \]
Central Moments replace \(X\) with the mean centered value \(X-\mu\).
\[ \mu_n = \mathbb{E}(X-\mu)^n \]
Remember:
\(p^{th}\) sample moment: \(\frac{1}{n} \sum_{i=1}^n X_i^p\)
\(p^{th}\) population moment: \(\mathbb{E}(X^p)\)
Let’s say \(x \sim \mathcal{N}(\theta, \sigma^2)\), \(k = 2\)
first moment: \(\frac{1}{n} \sum_{i=1}^n x_i= \mathbb{E}(X) = \mu\)
second moment: \(\frac{1}{n} \sum_{i=1}^n (x_i-\bar{x})^2 = \mathbb{E}(X-\mu)^2 = \mu^2 + \sigma^2\)
\(\frac{1}{n} \sum_{i=1}^n x_i^2 = \bar{x}^2 + \sigma^2\)
\(\hat{\sigma}^2 = \left [\frac{1}{n} \sum_{i=1}^n x_i^2 \right] - \bar{x}^2 = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2\)
\[ \text{arg}\,\max\limits_{\theta} \mathcal{L}(\theta|x) \]
The estimate of \(\theta\) is the one that maximize the likelihood of the data, \(x\).
\[
p(x | \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x- \mu}{2\sigma^2}}
\]
Where \(\theta = (\mu, \sigma)\). We want to choose values of \(\theta\) maximize the likelihood of the data, \(x\)
For a single data point the value of the likelihood function, \(L\left( \theta | x \right)\) is:
\[ \mathcal{L} \left( \theta | x\right) = p(x| \theta) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x- \mu}{2\sigma^2}} \]
If data points in a sample are independent, the likelihood value for all data points is simply the product of their individual likelihood values, since \(p(A,B) = p(A)*p(B) \text{ iff } A \mathrel{\unicode{x2AEB}} B\).
\[
\mathcal{L}\left(\theta | \mathbf{x} \right) = p(\mathbf{x} |\theta) = \prod_{i=1}^n p(x_i | \theta) = \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{x_i- \mu}{2\sigma^2}}
\]
The higher the likelihood of our data, the more evidence that a particular \(\theta\) is a good fit for the data.
\[ \text{arg max}_{\theta} \left[ \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{y_i- \mu_i}{2\sigma^2}} \right] \]
to maximize, we:
But…
…taking the derivative of a function of products is hard, so we use log likelihood.
\[ \ell\left(\theta | \mathbf{x} \right) = \log\left(\mathcal{L}\left(\theta | \mathbf{x} \right)\right) = \\ -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log (\sigma^2) -\frac{1}{2 \sigma^2} \sum_{i=1}^n (x_i - \mu_i)^2 \]
Note: \(\log()\) is a monotonically increasing function, so choosing \(\theta\) that maximizes \(\mathcal{l}\left(\theta | \mathbf{y} \right)\) will also maximize \(\mathcal{L}\left(\theta | \mathbf{y} \right)\)
\[ \ell\left(\theta | \mathbf{x} \right) = -\frac{n}{2} \log(2\pi) - \frac{n}{2} \log (\sigma^2) -\frac{1}{2 \sigma^2} \sum_{i=1}^n (x_i - \mu_i)^2 \]
Example with normal distribution:
\(\hat{\mu} : \frac{\partial}{\partial \mu} \ell(\theta | x) = 0\)
\(\hat{\sigma} : \frac{\partial}{\partial \sigma} \ell(\theta | x) = 0\)
Solution for Normal Distribution:
\(\hat{\mu} = \frac{1}{n} \sum_{i=1}^nx_i\)
\(\hat{\sigma} = \frac{1}{n} \sum_{i=1}^n(x_i - \hat{\mu})^2\)